Source files: https://github.com/djlofland/DATA606_F2019/tree/master/FinalProject
In the late 1990’s I worked for an NGO that focused on training doctors, nurses and midwives on women reproductive health issues and health during pregnancy. The NGO developed and tested training programs in very rural areas around the world to teach health care providers how to provide care for women with no easy access to primary health care facilities or doctors.
UNICEF provides data sets with data and trends in both maternal health and outcomes along with indicators by country.
For an understanding of the problem space, see Maternal mortality.
Some key facts they provide:
I will consider Maternal Mortality Rate by country through the features: GDP, Population, Births, Health Spending, and Skilled Attendants during Birth. While factors like GDP and Population are largely out of the control of individual countries, Birthrate, Health Spending, and Skilled Staff at Birth are levers a country could influence through policies and education if these were found to correlate with Maternal Mortality Rates. Note this is an observational study looking for correlations, not causation. Features found to have significant correlation would be candidates for follow-up experimental studies.
The primary data sets I’ll explore are:
Each of these datasets required Tidy processing to clean up. I also join the data where possible by country and year to tease out additional patterns not available within any one data set.
To facilitate joining tables using different Country label conventions, I scraped a look-up table with country names and ISO3 codes which I join to my various datasets.
These datasets explore different years, but have broad overlap between 2000-2015 (every 5 years). My analysis will be somewhat limited by the available data. If I were doing a deeper dive, I would look for more complete data and/or limit my questions to those countries that have a richer set of data. My suspicion is that better country data and reporting may indicate more attention to tackling problems with maternal mortality and consequently, better data might come from countries with better reporting and more on-the-ground efforts.
All datasets are downloaded from primary internet sources and cached locally to ease repeat analysis.
cache_folder <- './data/cache/' # path where we cache csv data files
raw_folder <- './data/raw/' # path wher we cache raw data files we downloadDuring initial research to locate datasets, I found that different datasets included ISO2 (Alpha 2), ISO3 (Alpha 3) and/or descriptive country names. Just in case it’s needed, we need a look-up table to join datasets using different country name conventions.
url <- "https://www.iban.com/country-codes"
cache_fn <- 'country_codes.csv'
isCacheFound <- FALSE
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
}
# Cached ata file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(local_cache_fn, col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else {
# Load Raw Data
data_df <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[@id="myTable"]') %>%
html_table()
data_df <- data_df[[1]]
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
}
iso3_codes <- data_df
# Convert year columns into a variable
iso3_codes <- iso3_codes %>%
mutate(ISO3 = `Alpha-3 code`, ISO2 = `Alpha-2 code`) %>%
select(Country, ISO3, ISO2)
print(msg)
## [1] "Cached CSV data loaded."knitr::kable(iso3_codes[1:10, ], booktabs = T, caption = "ISO2 and ISO3 Country Codes. Tidy data, first 10 rows shown. (Source: http://iban.com/country-codes)")| Country | ISO3 | ISO2 |
|---|---|---|
| Afghanistan | AFG | AF |
| Åland Islands | ALA | AX |
| Albania | ALB | AL |
| Algeria | DZA | DZ |
| American Samoa | ASM | AS |
| Andorra | AND | AD |
| Angola | AGO | AO |
| Anguilla | AIA | AI |
| Antarctica | ATA | AQ |
| Antigua and Barbuda | ATG | AG |
Changes in country population might change how we interpret maternal mortality rate. While the rate should theoretically be independent from population (by definition MMR is maternal deaths per 100,000 live births), if there is a large increase or decrease in population, this might impact conditions around pregnancy (nutrition, access to health care, etc) and/or conditions at time of birth (e.g. access to health care). Since MMR is “deaths per live births”, I’ll end up needing population data to help calculate the number of live births.
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population"
cache_fn <- 'population.csv'
isCacheFound <- FALSE
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
}
# Cached ata file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(local_cache_fn, col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else {
# Load Raw Data
data_df <- url %>%
xml2::read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
html_table()
data_df <- data_df[[1]]
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
}
population_df <- data_df
# Convert year columns into a variable
population_df <- population_df %>%
mutate(Country = `Country (or dependent territory)`) %>%
select(Country, `2000`, `2005`, `2010`, `2015`) %>%
gather(Year, Population, -Country)
# Fix datatypes
population_df$Year <- as.numeric(population_df$Year)
# Filter years
population_df <- population_df %>% filter(Year >= 2000 & Year <= 2015) %>% arrange(Country, Year)
population_df$Population <- population_df$Population* 1000
population_df <- population_df %>% left_join(iso3_codes)
print(msg)
## [1] "Cached CSV data loaded."knitr::kable(population_df[1:10, ], booktabs = T, caption = "Population Data. Tidy data, first 10 rows shown. (Source: https://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population)")| Country | Year | Population | ISO3 | ISO2 |
|---|---|---|---|---|
| Afghanistan | 2000 | 22462000 | AFG | AF |
| Afghanistan | 2005 | 26335000 | AFG | AF |
| Afghanistan | 2010 | 29121000 | AFG | AF |
| Afghanistan | 2015 | 32565000 | AFG | AF |
| Albania | 2000 | 3159000 | ALB | AL |
| Albania | 2005 | 3025000 | ALB | AL |
| Albania | 2010 | 2987000 | ALB | AL |
| Albania | 2015 | 3030000 | ALB | AL |
| Algeria | 2000 | 30639000 | DZA | DZ |
| Algeria | 2005 | 32918000 | DZA | DZ |
While there is variation between countries, the overall birthrate worldwide has declined. To understand changes in maternal mortality, we might need to estimate how many births are occurring during years of interest in each country. It’s possible that changes in birth rates could have an impact on MMR.
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
source_url <- 'http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=excel'
raw_fn <- 'API_SP.DYN.CBRT.IN_DS2_en_excel_v2_248743.xlsx'
cache_fn <- 'births.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 3 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 0 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 1 # optional for Excel Workbooks
data_column_names <- c('Country', 'ISO3', 'Indicator', 'Indicator_Code', as.character(seq(1960, 2017, 1)))
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste(raw_folder, raw_fn, sep = '')
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Attempt to download the dataset
download.file(source_url, local_raw_fn, method="auto")
if(file.size(local_raw_fn) > 0) {
msg <- paste('Downloaded file from: ', source_url, sep='')
isRawFound <- TRUE
} else {
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
}
print(msg)
## [1] "Found cached copy: ./data/cache/births.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
birthrate_df <- data_df
# Convert year columns into a variable
birthrate_df <- birthrate_df %>%
select(Country, `2000`, `2005`, `2010`, `2015`) %>%
gather(Year, BirthRate, -Country) %>%
arrange(Country, Year) %>%
inner_join(iso3_codes)
# Fix datatypes
birthrate_df$Year <- as.numeric(birthrate_df$Year)
print(msg)
## [1] "Cached CSV data loaded."knitr::kable(birthrate_df[1:10, ], booktabs = T, caption = "Birthrate Data. Tidy data, first 10 rows shown. Births per 1000 people. (Source: http://api.worldbank.org/v2/en/indicator/SP.DYN.CBRT.IN?downloadformat=csv)")| Country | Year | BirthRate | ISO3 | ISO2 |
|---|---|---|---|---|
| Afghanistan | 2000 | 48.021 | AFG | AF |
| Afghanistan | 2005 | 44.723 | AFG | AF |
| Afghanistan | 2010 | 39.829 | AFG | AF |
| Afghanistan | 2015 | 34.809 | AFG | AF |
| Albania | 2000 | 16.436 | ALB | AL |
| Albania | 2005 | 12.821 | ALB | AL |
| Albania | 2010 | 12.001 | ALB | AL |
| Albania | 2015 | 12.197 | ALB | AL |
| Algeria | 2000 | 19.554 | DZA | DZ |
| Algeria | 2005 | 20.774 | DZA | DZ |
This is the primary data set to which everything else will be linked. Here we see the maternal mortality rate provided as deaths per 100,000 live births broken out by country and year.
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
source_url <- 'https://data.unicef.org/wp-content/uploads/2015/11/MMR-trend-estimates-2000-2017_MMEIG-2.xlsx'
raw_fn <- 'MMR-trend-estimates-2000-2017_MMEIG-2.xlsx'
cache_fn <- 'mmr_data.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 4 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 17 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 1 # optional for Excel Workbooks
data_column_names <- c('ISO3','Country','2000','2005','2010','2015', '2017')
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste('data/raw/', raw_fn, sep = '')
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Attempt to download the dataset
download.file(source_url, local_raw_fn, method="auto")
if(file.size(local_raw_fn) > 0) {
msg <- paste('Downloaded file from: ', source_url, sep='')
isRawFound <- TRUE
} else {
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
}
print(msg)
## [1] "Found cached copy: ./data/cache/mmr_data.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
mmr_df <- data_df
# Convert year columns into a variable
mmr_df <- mmr_df %>%
gather(Year, MMR, -ISO3, -Country)
# Fix datatypes
mmr_df$Year <- as.numeric(mmr_df$Year)
# Filter years
mmr_df <- mmr_df %>% filter(Year <= 2015) %>% arrange(Country, Year)
print(msg)
## [1] "Cached CSV data loaded."knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortality Data. Count of maternal deaths per 100,000 live births. Tidy data, first 10 rows shown. Years included: 2000, 2005, 2010, 2015. (Source: https://data.unicef.org/resources/data set/maternal-health-data/)")| ISO3 | Country | Year | MMR |
|---|---|---|---|
| AFG | Afghanistan | 2000 | 1450 |
| AFG | Afghanistan | 2005 | 1140 |
| AFG | Afghanistan | 2010 | 954 |
| AFG | Afghanistan | 2015 | 701 |
| ALB | Albania | 2000 | 23 |
| ALB | Albania | 2005 | 22 |
| ALB | Albania | 2010 | 21 |
| ALB | Albania | 2015 | 15 |
| DZA | Algeria | 2000 | 161 |
| DZA | Algeria | 2005 | 127 |
UNICEF provides country level data describing a number of different indicators related to pregnancy and births. The downloaded raw data set (Excel file) includes a number of different indicator tabs, “Skilled Attendant at Birth” (SAB) being only one of many. We may be interested in how the presence of a trained attendant might correlate with the more central question of maternal mortality during birth. UNICEF don’t report for all countries, only a subset, so that may limit analysis to specific countries.
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
source_url <- 'https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
raw_fn <- 'maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
cache_fn <- 'sab_indicators.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 8 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 9 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 5 # optional for Excel Workbooks
data_column_names <- c('ISO3','Country','Year','Total','Age_1517','Age_1819', 'Age_lt20', 'Age_gt20', 'Age_2034', 'Age_3549', 'Source', 'SourceYear')
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste(raw_folder, raw_fn, sep = '')
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Attempt to download the dataset
download.file(source_url, local_raw_fn, method="auto")
if(file.size(local_raw_fn) > 0) {
msg <- paste('Downloaded file from: ', source_url, sep='')
isRawFound <- TRUE
} else {
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
}
print(msg)
## [1] "Found cached copy: ./data/cache/sab_indicators.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
sab_df <- data_df
# While this is a wide dataset and it's trivial to gather() the age data, for my questions, I really don't need age. Since my MMR, Population and NHA data do not provide Age breakouts, I will only focus on the Total percentage.
sab_df <- sab_df %>% select(ISO3, Country, Year, Total, Source)
# Fix datatypes
sab_df$Year <- as.numeric(sab_df$Year)
sab_df$Total <- as.numeric(sab_df$Total)
# Since there are two different sources of data, we might have a case where they report for the same year. I'm going to group_by() and average the totals in the case where we have multiple data points.
sab_df <- sab_df %>%
group_by(ISO3, Year) %>%
dplyr::summarize(SAB_Pct = mean(Total, na.rm = TRUE))
sab_df <- sab_df %>%
filter(Year == 2000 | Year == 2005 | Year == 2010 | Year == 2015)
print(msg)
## [1] "Cached CSV data loaded."knitr::kable(sab_df[1:10, ], booktabs = T, caption = "Skilled Attendant at Birth. Tidy data, first 10 rows shown. (Source: https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx)")| ISO3 | Year | SAB_Pct |
|---|---|---|
| AFG | 2010 | 39.00000 |
| ALB | 2000 | 99.12912 |
| ALB | 2005 | 99.75600 |
| ARM | 2000 | 97.17686 |
| ARM | 2005 | 98.58195 |
| ARM | 2010 | 100.00000 |
| AZE | 2000 | 87.46867 |
| BDI | 2000 | 25.17007 |
| BDI | 2005 | 33.62995 |
| BDI | 2010 | 67.18792 |
WHO provides on online tool to export data from WHO member countries on a variety of measures. After exploring options, I honed in on Health Expenditures. We may be curious whether spending on Health has any noticeable correlation with maternal mortality outcomes. I manually downloaded the NHA_indicators.xlsx from their online tool.
reference: http://apps.who.int/nha/database/Home/Index/en source: http://apps.who.int/nha/database/ViewData/Indicators/en
# reference: http://apps.who.int/nha/database/Home/Index/en
# source: http://apps.who.int/nha/database/ViewData/Indicators/en
# We first want to check if the data has already been loaded and muched as a local cached .csv file. If the csv file is available, use that. If
# it's not, then we will download and/or load the raw data, munge and save as a local cache .csv.
# Source Data File
#source_url <- 'https://data.unicef.org/wp-content/uploads/2018/07/maternal_health_adolescents_indicators_April-2016_250d599.xlsx'
raw_fn <- 'NHA_indicators.xlsx'
cache_fn <- 'nha_indicators.csv'
# Which WorkSheet to load and there are some extra rows above & below our table of interest.
data_header_rows <- 2 # rows at top to skip (note, blank rows are automatically skipped)
data_tail_rows <- 0 # rows at bottom to skip
data_header <- FALSE
data_sheet <- 1 # optional for Excel Workbooks
data_column_names <- c('Country','Indicator','Note','Year_2000','Year_2005','Year_2010', 'Year_2015')
# Some variables to help with flow control
isCacheFound <- FALSE
isRawFound <- FALSE
isDataLoaded <- FALSE
local_raw_fn <- paste(raw_folder, raw_fn, sep = '')
local_cache_fn <- paste(cache_folder, cache_fn, sep = '')
# Check if we have a local copy of the data available to load
if(file.exists(local_cache_fn)) {
msg <- paste('Found cached copy: ', local_cache_fn, sep='')
isCacheFound <- TRUE
} else if(file.exists(local_raw_fn)) {
msg <- paste('Found raw copy: ', local_raw_fn, ' (May need processing)', sep = '')
isRawFound <- TRUE
} else {
# Since data came from an online tool, we don't have a specific file we can download
msg <- paste("File not found and couldn't be downloaded. Check file name and/or source.")
}
print(msg)
## [1] "Found cached copy: ./data/cache/nha_indicators.csv"
# Cached data file found
if (isCacheFound) {
# Load CSV Files
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
isDataLoaded <- TRUE
msg <- 'Cached CSV data loaded.'
} else if (isRawFound) {
# Load Raw File - Make sure it's excel since read.xlsx will break with other file formats
if (str_detect(local_raw_fn, '.*\\.xlsx?$')) {
data_df <- read.xlsx(local_raw_fn, sheetIndex = data_sheet, header=data_header)
isDataLoaded <- TRUE
msg <- 'Raw data loaded.'
##### DATA PROCESSING start #####
# Are there any extraneous row at to top to remove?
if (data_header_rows > 0) {
data_df <- data_df[(data_header_rows + 1):nrow(data_df),]
data_header_rows <- 0
rownames(data_df) <- NULL
}
# Did the raw file have any extraneous rows at the bottom to remove?
if (data_tail_rows > 0) {
last_row <- nrow(data_df)-data_tail_rows
data_df <- data_df[0:last_row,]
data_tail_rows <- 0
rownames(data_df) <- NULL
}
# Remove any empty columns in the raw data
data_df <- data_df %>% select_if(function(x) {!all(is.na(x))})
# Set the Column names
if (length(data_column_names) > 0) {
names(data_df) <- data_column_names
data_column_names <- list()
}
# Convert to a tibble
data_df <- as_tibble(data_df)
##### DATA PROCESSING done #####
# Cache the processed as CSV for future
write.csv(data_df, local_cache_fn, row.names=FALSE, na="")
msg <- 'Raw data loaded, processed and saved to cache.'
data_df <- read_csv(paste('data/cache/', cache_fn, sep = ''), col_names = TRUE)
}
} else {
# Bummer - no load data file was found to load
msg <- 'Data NOT loaded.'
}
nha_df <- data_df
print(msg)
## [1] "Cached CSV data loaded."
# Remove to extra Note column we won't need
nha_df <- nha_df %>% select(-Note)
# Convert year columns into a variable
nha_df <- nha_df %>%
gather(Year, Amount, -Country, -Indicator) %>%
mutate(Year = stringr::str_replace(Year, "Year_", "")) %>%
drop_na()
# Fix datatypes
nha_df$Year <- as.numeric(nha_df$Year)
nha_df$Amount <- as.numeric(nha_df$Amount)
# Break out CHE into a separate column
exp_data <- nha_df %>% filter(Indicator == 'Current Health Expenditure (CHE) per Capita in PPP') %>%
mutate(CHE = Amount) %>%
select(-Indicator, -Amount) %>%
arrange(Country, Year) %>%
drop_na()
# Break out GDP into a separate column
gdp_data <- nha_df %>% filter(Indicator == 'Gross Domestic Product') %>%
mutate(GDP = Amount) %>%
select(-Indicator, -Amount) %>%
arrange(Country, Year) %>%
drop_na()
# Join the columns back into a table based on Country & Year
nha_df <- gdp_data %>%
left_join(exp_data) %>%
left_join(iso3_codes)knitr::kable(nha_df[1:10, ], booktabs = T, caption = "WHO Reported Health Expenditures & GDP by Country & Year. Tidy data, first 10 rows shown. Note CHE are adjusted with Purchase Price Parity (PPP) so dollars have equivalent spending power per country. This is also sometimes referred to as the 'Big Mac Index' (Source: http://apps.who.int/nha/database/ViewData/Indicators/en)")| Country | Year | GDP | CHE | ISO3 | ISO2 |
|---|---|---|---|---|---|
| Afghanistan | 2005 | 24851.345 | 98.61208 | AFG | AF |
| Afghanistan | 2010 | 44457.152 | 132.27128 | AFG | AF |
| Albania | 2000 | 11925.958 | 258.28327 | ALB | AL |
| Albania | 2005 | 17663.350 | 363.84904 | ALB | AL |
| Albania | 2010 | 28073.799 | 478.31474 | ALB | AL |
| Algeria | 2000 | 252385.066 | 282.38500 | DZA | DZ |
| Algeria | 2005 | 365236.389 | 354.95766 | DZA | DZ |
| Algeria | 2010 | 455452.366 | 645.28783 | DZA | DZ |
| Andorra | 2000 | 2132.781 | 3049.00618 | AND | AD |
| Andorra | 2005 | 3444.711 | 3741.40393 | AND | AD |
Note: PPP refers to “Price Point Parity”. This country-by-country adjustment reflects different purchasing power in different geo’s and attempts to normalize spending for better comparison across geo’s. We know intuitively that cost of living and items vary greatly around the world and $1 in the US doesn’t NOT have the same purchasing power as $1 in India. PPP normalizes so that spending can be compared across countries.
Note that we are combining data from a number of sources and years. As such there will be a number of NA’s where data is missing in various columns. Here I also calculate the actual number of maternal deaths based on populations, birth rates and MMR rates.
# Use the current population by year * birthrate by year to get actual births
births_df <- birthrate_df %>%
inner_join(population_df) %>%
mutate(Births = round(Population / 1000 * BirthRate)) %>%
select(ISO3, Country, Year, Population, BirthRate, Births)
births_df <- births_df %>% left_join(iso3_codes)
# now that we have births, we can multiply by MMR (deaths per 100,000 births)
mmr_df <- mmr_df %>%
left_join(births_df) %>%
mutate(Deaths = round(Births / 10000 * MMR))
# join in Skilled Attendant at Birth (SAB)
mmr_df <- mmr_df %>%
left_join(sab_df) %>%
left_join(nha_df)knitr::kable(mmr_df[1:10, ], booktabs = T, caption = "Maternal Mortality table with supporting columns.")| ISO3 | Country | Year | MMR | Population | BirthRate | Births | ISO2 | Deaths | SAB_Pct | GDP | CHE |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AFG | Afghanistan | 2000 | 1450 | 22462000 | 48.021 | 1078648 | AF | 156404 | NA | NA | NA |
| AFG | Afghanistan | 2005 | 1140 | 26335000 | 44.723 | 1177780 | AF | 134267 | NA | 24851.34 | 98.61208 |
| AFG | Afghanistan | 2010 | 954 | 29121000 | 39.829 | 1159860 | AF | 110651 | 39.00000 | 44457.15 | 132.27128 |
| AFG | Afghanistan | 2015 | 701 | 32565000 | 34.809 | 1133555 | AF | 79462 | NA | NA | NA |
| ALB | Albania | 2000 | 23 | 3159000 | 16.436 | 51921 | AL | 119 | 99.12912 | 11925.96 | 258.28327 |
| ALB | Albania | 2005 | 22 | 3025000 | 12.821 | 38784 | AL | 85 | 99.75600 | 17663.35 | 363.84904 |
| ALB | Albania | 2010 | 21 | 2987000 | 12.001 | 35847 | AL | 75 | NA | 28073.80 | 478.31474 |
| ALB | Albania | 2015 | 15 | 3030000 | 12.197 | 36957 | AL | 55 | NA | NA | NA |
| DZA | Algeria | 2000 | 161 | 30639000 | 19.554 | 599115 | DZ | 9646 | NA | 252385.07 | 282.38500 |
| DZA | Algeria | 2005 | 127 | 32918000 | 20.774 | 683839 | DZ | 8685 | NA | 365236.39 | 354.95766 |
We generally see a decline in mortality rates from 2000 to 2015 data.
data <- mmr_df %>% filter(Year==2000 | Year==2015) %>% select(MMR, Year)
data$Year <- as.factor(data$Year)
ggplot(data=data, aes(x=MMR, color=Year)) +
geom_histogram(fill='white', alpha=0.2, position="identity") +
scale_color_discrete(name="Year", labels=c(2000, 2015)) + labs(title = "2000 and 2015 Maternal Mortality Rates (MMR)") +
theme(plot.title = element_text(hjust = 0.5))
data <- mmr_df %>%
select(Country, MMR, Year) %>%
rename(region = Country,
value = MMR) %>%
mutate(region = tolower(region)) %>%
mutate(region = recode(region,
"united states" = "united states of america",
"congo, dem. rep." = "democratic republic of the congo",
"congo, rep." = "republic of congo",
"korea, dem. rep." = "south korea",
"korea. rep." = "north korea",
"tanzania" = "united republic of tanzania",
"serbia" = "republic of serbia",
"slovak republic" = "slovakia",
"yemen, rep." = "yemen"))
country_choropleth(data %>% filter(Year == 2000), num_colors = 9) +
scale_fill_brewer(palette="YlOrRd") +
labs(title = paste("MMR by country (", 2000, ")"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))
country_choropleth(data %>% filter(Year == 2005), num_colors = 9) +
scale_fill_brewer(palette="YlOrRd") +
labs(title = paste("MMR by country (", 2005, ")"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))
country_choropleth(data %>% filter(Year == 2010), num_colors = 9) +
scale_fill_brewer(palette="YlOrRd") +
labs(title = paste("MMR by country (", 2010, ")"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))
country_choropleth(data %>% filter(Year == 2015), num_colors = 9) +
scale_fill_brewer(palette="YlOrRd") +
labs(title = paste("MMR by country (", 2015, ")"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))Let’s see which countries have the most reduction in maternal deaths.
# Group data by country and calculate the delta in deaths over the years
tbl <- mmr_df %>%
group_by(Country) %>%
dplyr::summarize(delta = round(first(Deaths) - last(Deaths))) %>%
arrange(desc(delta)) %>%
drop_na()
# barplot
ggplot(tbl %>% top_n(10, delta), aes(x=reorder(Country, delta), y=delta, fill=Country)) +
ggtitle("Top 10 Countries with Reductions in Deaths") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_bar(position='dodge', stat="identity") +
geom_text(aes(reorder(Country, delta), y=delta, label = round(delta)), size = 2, hjust = 1.1, vjust = 0.5 ) +
#scale_y_discrete(labels = comma, breaks=c(1000, 100000, 200000, 600000)) +
ylim(c(0,700000)) +
xlab('Country') +
ylab('Deaths') +
theme(legend.position = "none") +
coord_flip() +
scale_y_continuous(labels = comma)… and those with increases in maternal deaths. Nigeria is clearly experiencing a crisis.
# barplot
ggplot(tbl %>% top_n(-10, delta), aes(x=reorder(Country, delta), y=delta, fill=Country)) +
ggtitle("Bottom 10 Countries with Increases in Deaths") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_bar(position='dodge', stat="identity") +
geom_text(aes(reorder(Country, delta), y=delta, label = round(delta)), size = 2, hjust = 1.5, vjust = 0.5 ) +
scale_y_discrete(labels = comma, breaks=c(-30000, -20000, -1000)) +
ylim(c(-25000,0)) +
xlab('Country') +
ylab('Deaths') +
theme(legend.position = "none") +
coord_flip()data <- tbl %>%
filter(delta > 0) %>%
rename(region = Country,
value = delta) %>%
mutate(region = tolower(region),
value = value) %>%
mutate(region = recode(region,
"united states" = "united states of america",
"congo, dem. rep." = "democratic republic of the congo",
"congo, rep." = "republic of congo",
"korea, dem. rep." = "south korea",
"korea. rep." = "north korea",
"tanzania" = "united republic of tanzania",
"serbia" = "republic of serbia",
"slovak republic" = "slovakia",
"yemen, rep." = "yemen"))
country_choropleth(data, num_colors = 9) +
scale_fill_brewer(palette="Greens") +
labs(title = paste("MMR Improvements (2000-2015)"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))data <- tbl %>%
filter(delta < 0) %>%
rename(region = Country,
value = delta) %>%
mutate(region = tolower(region),
value = -value) %>%
mutate(region = recode(region,
"united states" = "united states of america",
"congo, dem. rep." = "democratic republic of the congo",
"congo, rep." = "republic of congo",
"korea, dem. rep." = "south korea",
"korea. rep." = "north korea",
"tanzania" = "united republic of tanzania",
"serbia" = "republic of serbia",
"slovak republic" = "slovakia",
"yemen, rep." = "yemen"))
country_choropleth(data, num_colors = 9) +
scale_fill_brewer(palette="Reds") +
labs(title = paste("MMR Losses (2000-2015)"), fill = "value") +
theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5))Let’s start with a simple correlation matrix to understand which features correlate positively with MMR (i.e. increase dates of death) or negatively (decreased rates of death). Note that we are particularly interested in features in RED that negatively correlate with MMR. We see that County Health Expenditures (CHE), Skilled Attendant at Birth (SAB_Pct) and GDP appear as promising features to explore. Births have a positive correlation – more births correlating with a higher rate of moms dying. Since MMR is a rate (deaths per 100k births), we might have assumed that the number of births would have no direct affect on the rate of women dying. Since we cannot attribute causation, we can merely say that in countries with more births, there is an increase in the rate of deaths. Maybe more births tax limited resources? This might be a different research question to explore in a later study. Interestingly, Population doesn’t seem to correlate with MMR. So higher population doesn’t seem to be a factor in explaining MMR.
mmr_data <- mmr_df %>%
select(Population, Births, GDP, CHE, MMR, SAB_Pct) %>%
drop_na()
mmr_cor <- cor(mmr_data, method = c("spearman"))
corrplot(mmr_cor)
palette = colorRampPalette(c("red", "white", "green")) (20)
heatmap(x = mmr_cor, col = palette, symm = TRUE)We see that generally, countries spend less than $4000 (PPP) adjusted dollars per person on health expenditures and as GDP increases, there is an increase in CHE. We some clear outliers with very high GDP countries spending very little and vice versa, countries with relatively low GDP spending reporting quite high CHE. While this is an interesting pattern, I suspect that CHE is not entirely reported nor being spent on the poorest population that might be more impacted by MMR. Unfortunately, I don’t have the data set to fully explore those questions.
ggplot(mmr_df, aes(GDP, CHE)) +
geom_point() +
labs(title='Country Health Expenditure (PPP) vs Country GDP (PPP)') +
xlab('GDP (Millions, PPP)') +
ylab('CHE (Dollars, PPP)') +
scale_x_continuous(labels = comma) +
theme(plot.title = element_text(hjust = 0.5))We do see a pattern where higher reported health expenditures correlate with lower maternal mortality and inversely, those countries (and years) with lower CHE see higher MMR. The inflection point seems to be around $1000. Countries spending less than $1000 appear to have higher rates of MMR.
ggplot(mmr_df, aes(CHE, MMR)) +
geom_point() +
labs(title='Maternal Mortality Rate (MMR) vs Country Health Expenditure (PPP)') +
xlab('CHE (Dollars, PPP)') +
ylab('MMR (per 100k live births)') +
theme(plot.title = element_text(hjust = 0.5))Since MMR is a rate, let’s look at the count of deaths compared with health expenditures. The death count increases most dramatically with lower health expenditures and population impacts. There are a handful of countries high huge number of deaths per year and those correlate with the lowest outlay of health expenditures. If we filter out the outliers, we again see that the inflection point is ~$1000.
ggplot(mmr_df, aes(CHE, Deaths)) +
geom_point() +
labs(title='Maternal Deaths vs Country Health Expenditures (PPP)') +
xlab('CHE (Dollars, PPP)') +
ylab('Maternal Deaths') +
theme(plot.title = element_text(hjust = 0.5))
mmr_df %>%
filter(Deaths < 100000) %>%
ggplot(aes(CHE, Deaths)) +
geom_point() +
labs(title='Maternal Deaths vs Country Health Expenditures (PPP)', subtitle='< 100,000 deaths') +
xlab('CHE (Dollars, PPP)') +
ylab('Maternal Deaths') +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))Generally the Maternal Mortality rate is higher with counties having lower GDP. However, there are plenty of countries with low GDP and low MMR, so clearly GDP doesn’t tell the whole story.
ggplot(mmr_df, aes(GDP, MMR)) +
geom_point() +
labs(title='Maternal Mortality Rate (MMR) vs Country GDP (Millions, PPP)') +
xlab('GDP (Millions, PPP)') +
ylab('MMR (per 100k live births)') +
scale_x_continuous(labels = comma) +
theme(plot.title = element_text(hjust = 0.5))While we don’t have many data points, there does appear to be a trend with lowered maternal mortality when there is a higher percentage of births managed by a health professional. That said, a Skilled Attendant at birth may only be able to do so much … there may be affects earlier in pregnancy leading to mortality.
ggplot(mmr_df %>% drop_na(), aes(SAB_Pct, MMR)) +
geom_point() +
labs(title='Maternal Mortality Rate vs Skilled Attendants at Birth (SAB)',
caption='note, we only have 98 data points with both SAB and MMR')+
xlab('Skilled Attendant at Birth (%)') +
ylab('MMR (per 100k live births)') +
theme(plot.title = element_text(hjust = 0.5))As with mortality rate, we generally see fewer reported deaths as the percent of births are handled by a trained health professional.
ggplot(mmr_df %>% drop_na(), aes(SAB_Pct, Deaths)) +
geom_point() +
scale_y_continuous(labels = comma) +
ylim(0, 300000) +
labs(title='Maternal Deaths vs Skilled Attendants at Birth (SAB)') +
xlab('Skilled Attendant at Birth (%)') +
ylab('Maternal Deaths') +
theme(plot.title = element_text(hjust = 0.5))Let’s build a model with the known features that helps to explain MMR and explore which features might have more influence on outcomes. This type of analysis would be useful for policy makers to design followup experiments that attempt to reduce MMR.I’ll start by exploring each feature separately, then combine and do multiple regression.
Population is highly skewed which isn’t ideal for regression analysis - we log transform Population and run a simple linear regression to assess how much Population influences MMR and whether the effect is significant. Interestingly, Population is a significant factor and as Population increases, MMR increases. However, with an adjusted R\(^2\) of 0.008236, Population explains less than 1% of the variability in the model and the residuals and Q-Q plot suggest that a linear model with Population probably isn’t appropriate. Given this, we won’t consider Population in our multiple regression model.
# Run Linear Regression using Population to predict DeathRate
pop_lm <- lm(MMR ~ log(Population), data = mmr_df)
(pop_lm <- summary(pop_lm))
##
## Call:
## lm(formula = MMR ~ log(Population), data = mmr_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -281.23 -199.69 -141.84 76.82 2273.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -61.757 116.009 -0.532 0.595
## log(Population) 17.731 7.336 2.417 0.016 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 323.4 on 582 degrees of freedom
## (156 observations deleted due to missingness)
## Multiple R-squared: 0.009937, Adjusted R-squared: 0.008236
## F-statistic: 5.841 on 1 and 582 DF, p-value: 0.01596
residPlot(pop_lm)As with Population, Births are also highly skewed. We will again use a log transformation as we explore the relationship of Births and MMR. log(Births) is highly significant and as Births increase, MMR increases and in this simple model we see an adjusted R\(^2\) of 0.08013. On it’s own, Births explains < 10% of variability we see in MMR. Of concern is that our residuals are not normally distributed as seen in the Q-Q plot. If we want to try using Births we would need to introduce a non-linear regression. We will not include Births in any combined models.
# Run Linear Regression using Population to predict DeathRate
birth_lm <- lm(MMR ~ log(Births), data = mmr_df)
(birth_lm <- summary(birth_lm))
##
## Call:
## lm(formula = MMR ~ log(Births), data = mmr_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -423.00 -186.05 -109.10 73.43 2252.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -362.742 81.551 -4.448 1.04e-05 ***
## log(Births) 49.084 6.821 7.196 1.91e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 311.4 on 582 degrees of freedom
## (156 observations deleted due to missingness)
## Multiple R-squared: 0.08171, Adjusted R-squared: 0.08013
## F-statistic: 51.79 on 1 and 582 DF, p-value: 1.915e-12
residPlot(birth_lm)In our exploratory analysis, SAB_Pct seemed to be highly negatively correlated with MMR. We find that SAB_Pct is highly significant and explains ~53% of variability in MMR. We should note that we had limited data for SAB_Pct, but having skilled attendants at birth appears to be an important feature related to lowered MMR. We note that the residuals have a few bumps in the extremes, but are closer to normal. The Q-Q plot is somewhat linear through the central region. If we were looking to make a predictive model, we might explore removing outliers that are affecting the extremes. For a simple analysis to answer whether SAB_Pct is an important feature, this is probably sufficient.
# Run Linear Regression using Population to predict DeathRate
sab_lm <- lm(MMR ~ SAB_Pct, data = mmr_df)
(sab_lm <- summary(sab_lm))
##
## Call:
## lm(formula = MMR ~ SAB_Pct, data = mmr_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -584.43 -126.32 -28.24 91.64 1716.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1276.450 85.427 14.94 <2e-16 ***
## SAB_Pct -12.296 1.179 -10.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 310.4 on 96 degrees of freedom
## (642 observations deleted due to missingness)
## Multiple R-squared: 0.5312, Adjusted R-squared: 0.5263
## F-statistic: 108.8 on 1 and 96 DF, p-value: < 2.2e-16
residPlot(sab_lm)Since CHE is highly skewed, we will use the log transform of CHE in our model. We find that in a simple model, log(CHE) is highly significant and as health expenditures increase, MMR decreases. The model showed an adjusted R\(^2\) of 0.4366 meaning that CHE on it’s own accounts for ~44% of the variability we see in MMR by country. In our residuals, we do see some skew and bumps in the upper regions suggesting some outliers. However, the distribution is mostly normal and the Q-Q plot linear through most of the residual points.
# Run Linear Regression using Population to predict DeathRate
che_lm <- lm(MMR ~ log(CHE), data = mmr_df)
(che_lm <- summary(che_lm))
##
## Call:
## lm(formula = MMR ~ log(CHE), data = mmr_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.18 -138.63 -30.99 80.83 2078.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1161.193 53.900 21.54 <2e-16 ***
## log(CHE) -158.228 8.788 -18.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240 on 416 degrees of freedom
## (322 observations deleted due to missingness)
## Multiple R-squared: 0.438, Adjusted R-squared: 0.4366
## F-statistic: 324.2 on 1 and 416 DF, p-value: < 2.2e-16
residPlot(che_lm)GDP is highly skewed, so we will use the log transform. GDP is a significant factor, but only explains about ~9% of the variability we see in MMR. As a country’s GDP increases, we do see a decrease in MMR. However, the residual plots suggest that a linear treatment might not be appropriate for GDP is we were looking for a true predictive model. For a simple yes/no as to whether GDP is a significant factor, we can probably say yes, but that it doesn’t account for much of the variability.
# Run Linear Regression using Population to predict DeathRate
gdp_lm <- lm(MMR ~ log(GDP), data = mmr_df)
(gdp_lm <- summary(gdp_lm))
##
## Call:
## lm(formula = MMR ~ log(GDP), data = mmr_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -366.5 -176.5 -103.9 97.9 2159.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 712.000 78.320 9.091 < 2e-16 ***
## log(GDP) -46.197 7.129 -6.480 2.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 304.8 on 417 degrees of freedom
## (321 observations deleted due to missingness)
## Multiple R-squared: 0.09149, Adjusted R-squared: 0.08931
## F-statistic: 41.99 on 1 and 417 DF, p-value: 2.59e-10
residPlot(gdp_lm)Let’s perform a multiple regression combining the features that independently showed significance and seemed appropriate for a linear model: SAB_Pct and CHE. In this model, we explain ~55% of the variability and the model as a whole is very significant with a p-value close to 0. However, we only find SAB_Pct to be significant. We had previously noted that health expenditures (CHE) and Skilled attendants at births (SAB_PCT) were highly auto-correlated. As such, the model has arbitrarily attributed the effects most strongly with SAB_Pct and has treated CHE as not significant. This is a limitation of simple linear regressions.
# Run Linear Regression using Population to predict DeathRate
combined_lm <- lm(MMR ~ SAB_Pct + log(CHE), data = mmr_df)
(combined_lm <- summary(combined_lm))
##
## Call:
## lm(formula = MMR ~ SAB_Pct + log(CHE), data = mmr_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -575.23 -122.94 -17.92 74.31 1677.46
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1240.050 240.568 5.155 2.43e-06 ***
## SAB_Pct -13.819 2.375 -5.818 1.83e-07 ***
## log(CHE) 28.955 69.185 0.419 0.677
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 324.5 on 67 degrees of freedom
## (670 observations deleted due to missingness)
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5452
## F-statistic: 42.37 on 2 and 67 DF, p-value: 1.281e-12
residPlot(combined_lm)As an exploratory effort, let’s try fitting a Generalized Additive Model to
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
# Build the model
model <- gam(MMR ~ s(SAB_Pct) + s(log(CHE)), data = mmr_df)
summary(model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## MMR ~ s(SAB_Pct) + s(log(CHE))
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 447.16 38.69 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(SAB_Pct) 1.000 1.000 34.151 1.21e-07 ***
## s(log(CHE)) 1.265 1.488 0.116 0.741
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.548 Deviance explained = 56.2%
## GCV = 1.0989e+05 Scale est. = 1.0476e+05 n = 70
ggplot(mmr_df, aes(SAB_Pct, MMR) ) +
geom_point() +
stat_smooth(method = gam, formula = y ~ s(x))
ggplot(mmr_df, aes(log(CHE), SAB_Pct) ) +
geom_point() +
stat_smooth(method = gam, formula = y ~ s(x))Of the features explored (Population, GDP, Births, Health Expenditures and Skilled Attendant at Birth), we found that Health Expenditures and Skilled Attendant at Birth have the highest correlation with reduced Maternal Mortality Rates at child birth. These features were both signification and together explained ~55% of the variability we see in MMR across countries. While we cannot attribute causality, modeling would suggest these are areas for further experimental studies. To a less extent, we saw the increased Births had a positive correlation with MMR. It would be worthwhile to explore other data sources or do followup analysis to understand why more births might be associated with higher death rates.